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Recent advances in the field of statistical learning have established that learners are 
able to track regularities of multimodal stimuli, yet it is unknown whether the statistical 
computations are performed on integrated representations or on separate, unimodal 
representations. In the present study, we investigated the ability of adults to integrate audio 
and visual input during statistical learning. We presented learners with a speech stream 
synchronized with a video of a speaker's face. In the critical condition, the visual (e.g., 
/gi/) and auditory (e.g., /mi/) signals were occasionally incongruent, which we predicted 
would produce the McGurk illusion, resulting in the perception of an audiovisual syllable 
(e.g., /ni/). In this way, we used the McGurk illusion to manipulate the underlying statistical 
structure of the speech streams, such that perception of these illusory syllables facilitated 
participants' ability to segment the speech stream. Our results therefore demonstrate that 
participants can integrate audio and visual input to perceive the McGurk illusion during 
statistical learning. We interpret our findings as support for modality-interactive accounts 
of statistical learning. 
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INTRODUCTION 

Over the last 15 years, a growing body of research has detailed 
language learners' ability to extract statistical regularities from 
speech (hereafter statistical learning), particularly in the domain 
of speech segmentation. Many studies of statistical learning 
have examined this ability in the context of a single input 
modality, including auditory (Saffran etal, 1996, 1999), visual 
(Fiser and Aslin, 2002), and tactile stimuli (Conway and Chris- 
tiansen, 2005). However, since the learning environment is 
typically multimodal (Stein and Stanford, 2008), perceptual 
mechanisms may be tuned to operate optimally over multi- 
modal input, suggesting that unimodal indices of perceptual 
learning could underestimate their capacity (Shams and Seitz, 
2008). Consequently, there has been a recent increase in research 
investigating how statistical learning mechanisms track multi- 
modal input (e.g., Sell and Kaschak, 2009; Cunillera etal., 2010; 
Mitchel and Weiss, 2010, 2013; Thiessen, 2010). Numerous 
studies have demonstrated that adults are capable of success- 
fully tracking multiple statistical inputs simultaneously in sep- 
arate modalities (Conway and Christiansen, 2006; Seitz etal., 
2007; Emberson etal, 2011; Mitchel and Weiss, 2011), though 
the underlying processes remain unclear. When learning from 
multimodal input, do learners develop independent unimodal 
representations, a single multimodal representation, or some 
combination of the two? In the present study, we investi- 
gate this issue by exploring the influence of the McGurk illu- 
sion (a well-attested demonstration of audiovisual integration; 



McGurk and MacDonald, 1976) on multimodal statistical 
learning. 

In one of the initial studies on multimodal statistical learning, 
Seitz etal. (2007) simultaneously presented participants with an 
audio stream (non-tonal noises) and a visual stream (2-D shapes). 
At test, participants were able to correctly identify statistically 
defined audio, visual, and audiovisual bigrams that had appeared 
in the familiarization stream, demonstrating that learners are 
able to extract multiple, concurrent statistical patterns across 
sensory modalities. Moreover, Seitz et al. did not observe dispar- 
ities in performance when the streams were presented together 
or in isolation. Therefore, the authors concluded that statisti- 
cal learning in one modality is processed independently from 
input in another modality. In contrast, a more recent study has 
provided evidence of cross-modal effects during multimodal sta- 
tistical learning that are inconsistent with modality- independence 
(Mitchel and Weiss, 2011). In this study, adult learners were 
able to segment visual and auditory (tone) sequences simulta- 
neously when triplet boundaries across streams were in-phase, 
replicating the findings of Seitz and colleagues. However, learning 
was disrupted when the streams were offset such that the triplet 
boundaries across modalities were misaligned. This decrement 
in performance suggests that statistical learning of multimodal 
inputs are subject to cross-modal interference, as the relationship 
of boundary information between streams influenced participants' 
ability to segment each stream (Mitchel and Weiss, 2011). We pro- 
posed that statistical learning may be governed by an interactive 
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network of modality-specific mechanisms. In this view, learn- 
ing is constrained by the modality of the input (see Conway and 
Christiansen, 2005, 2006) while cross-modal effects operate via 
associative links between mechanisms (Mitchel and Weiss, 2010, 
2011; Emberson etal., 2011; Glicksohn and Cohen, 2013; see also 
Cunillera etal, 2010). 

While the aforementioned studies provide evidence that statis- 
tical learning mechanisms are capable of processing multimodal 
input, what is encoded from this input remains unclear. Specif- 
ically, when information from multiple modalities is available, 
are statistical computations performed on integrated, multimodal 
percepts or on unimodal representations? Although multimodal 
integration, or the coupling of two or more senses to produce a 
coherent multimodal representation, is a central property of per- 
ception (Shimojo and Shams, 2001), no study, to the best of our 
knowledge, has investigated this process in the context of statis- 
tical learning. A goal of the present study, then, is to investigate 
multimodal integration in statistical learning; specifically, we uti- 
lize the McGurk illusion to examine whether statistical learning of 
speech input operates on auditory input alone or on an integrated 
audiovisual representation. 

The McGurk illusion (McGurk and MacDonald, 1976) arises 
when incongruous visual information (e.g., lip movements) alters 
the auditory perception of speech. For example, one form of the 
McGurk illusion occurs when synchronously presented incongru- 
ent audio (e.g., /ba/) and visual (e.g., /ga/) syllables are integrated 
to be perceived as fused syllables (e.g., /da/). The McGurk illusion 
is widely regarded as a compelling behavioral index of audiovi- 
sual integration (e.g., Green, 1998; Massaro, 1998; Brancazio and 
Miller, 2005). Here, we test how auditory statistical learning may 
be influenced by the perceived audiovisual syllables resulting from 
the McGurk illusion. 

In the present study, we expose learners to a miniature arti- 
ficial language that provides no transitional probability cues to 
word boundaries. We paired the language with a synchronous 
video of a speaker's face in three conditions. In the Audio-only 
condition, the artificial speech stream is presented alone. In the 
Audiovisual Consistent condition, the speech stream is paired 
with a talking face display that perfectly matches the speech syl- 
lables. In the Audiovisual Inconsistent condition, inconsistencies 
between select auditory syllables and visual articulatory gestures 
are used to elicit a McGurk illusion that could alter the statistical 
structure of the artificial language. In this altered structure, the 
transitional probabilities should cue word boundaries, such that 
syllable-to-syllable transitional probabilities within words (0.50) 
should be greater than transitional probabilities between words 
(0.25). Thus, if learners compute transitional probabilities using 
an integrated percept, then the changes in the statistical struc- 
ture of the language in the Audiovisual Inconsistent condition 
should enhance learning relative to the Audio-only or Audiovisual 
Consistent conditions. 

MATERIALS AND METHODS 
PARTICIPANTS 

One hundred forty-two (98 female, 46 male) participants from 
Pennsylvania State University were included in the analyses. 
Eleven additional participants (7%) were excluded from analysis 



for failing to follow directions (7), such as falling asleep or 
removing headphones, and due to technical errors during the 
experiment (4). 

STIMULI 

The auditory stimuli consisted of an artificial language with four 
tri-syllabic (CV.CV.CV) words (see Table 1). Six consonants and 
six vowels were combined to form a total of six CV syllables. Each 
syllable was created by synthesizing natural speech syllables and 
removing any acoustic cues to word boundaries in a similar man- 
ner as described in previous statistical learning experiments (see 
Weiss etal, 2009, 2010; Mitchel and Weiss, 2010). We recorded a 
male speaker producing CVC syllables, with the final consonant 
being one of three possible places of articulation (bilabial, alve- 
olar, or velar). Coda consonants were recorded to preserve the 
co-articulatory vowel-to-consonant transitions when the CV syl- 
lables were later concatenated into trisyllabic words. Each CVC 
syllable was then hand-edited in Praat, removing the coda conso- 
nants and equating vowel duration. The syllables were synthesized 
in Praat, overlaying the same pitch (fO) contour onto each syllable 
in order to remove any pitch or stress cues to segmentation and 
then concatenated to form the words. 

The four words were concatenated into a continuous stream 
in a pseudo-random order, such that each word appeared an 
equal number of times and no word ever followed itself. The 
artificial language had flat transitional probabilities within and 
between words (0.50 -+ 0.50 -+ 0.50; see Table 1). Without sta- 
tistical cues to word boundary, it was predicted that this language 
should not be learned in the Audio-only or Audiovisual Consis- 
tent conditions. In addition, the order of words in the stream 
was constrained such that words 1 and 2 were only followed by 
words 3 and 4, and vice versa. In the Audiovisual Inconsistent 
condition, this order constraint allowed the McGurk illusion (if 
perceived) to alter the statistical structure of the entire language 
while only manipulating two word-final syllables. Specifically, per- 
ception of the McGurk syllables would alter the syllable inventory 
across which transitional probabilities were calculated. In the 
Audiovisual Inconsistent condition, the new, integrated syllable 
inventory would provide robust statistical word boundary cues 
(0.50 -+ 0.50 -> 0.25; see Table 1); thus, it was predicted that 
learning should occur in the Audiovisual Inconsistent condition if 
participants perceived the integrated, illusory syllables. The speech 
stream was comprised of three 4-min blocks for a total familiar- 
ization of 12 min. Between each block there was a 1 min silence 
during which the screen turned white. 

For the visual displays, a Sony Handicam was used to video- 
record an assistant lip-synching to an audio-stream while reading 
from a list of words mounted behind the camera (see Mitchel 
and Weiss, 2010, 2013). The video was then hand-edited in Adobe 
Premiere ® to ensure that the audio stream and video display were 
synchronous, aligning them such that the articulatory gestures 
of the lips coincided with the corresponding auditory event. The 
video was cropped to display only the lips of the actor, and then 
exported as a Quicktime movie. The content of the consistent 
visual display was the same as the audio stream. The content of 
the inconsistent visual stream, however, differed from the audio 
stream in two word-final syllables (audio: /mi/ and /pa/, visual: 
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Table 1 I Design of artificial language across display conditions. 

Display condition 

Audio-only Audiovisual consistent Audiovisual inconsistent (McGurk) 

Words so baa ta so baa ta so baa ta 

je lu mi je lu mi je lu ni 

bae je pa bae je pa baa je ta 

lu so ni lu so ni lu so ni 

TPs 0.5 -> 0.5 -> 0.5 0.5 -> 0.5 -> 0.5 0.5 -> 0.5 -> 0.25 

Bolded syllables in the Audiovisual Inconsistent condition represent the illusory, integrated McGurk syllables. Syllable-to-syllable transitional probabilities (TPs) are 
reported for each condition. TPs in the audiovisual conditions reflect the statistical structures of the languages if they include the integrated percept. 



/gi/ and /ka/, respectively). If these inconsistent audio and visual 
syllables were integrated, then participants should have perceived a 
McGurk illusion of /ni/ and /ta/ (MacDonald and McGurk, 1978). 

Learning of the statistically defined words was tested using an 
audio-only, 24 item word-identification task. The same test was 
given for each condition and consisted of six words, three part- 
words, and three non-words, with each item presented twice in 
a randomized order. The six words were sub-divided into three 
classes with two words each: audiovisual, audio-only, and McGurk. 
Audiovisual test items were always consistent across audio and 
visual input during familiarization (/so bae ta/, /lu so ni/, see 
Table 1). Audio-only test items were taken from the audio stream 
(/je lu mi/, /bae je pa/), and should have been heard by participants 
if they did not perceive the McGurk illusion during familiar- 
ization. McGurk test items were the auditory equivalent of the 
illusory words that participants in the Audiovisual Inconsistent 
condition should have perceived if the McGurk illusion produced 
a fused, integrated percept (/je lu ni/, /bae je ta/). Non-words were 
combinations of syllables that did not occur together during famil- 
iarization, but conserved positional information (e.g., words with 
syllables ABC and DEF could form non-words AEF or DBC). Part- 
words were formed by combining the third syllable of one word 
with the first and second syllables of another word (e.g., ABC and 
DEF yield part-words CDE and FAB). 

PROCEDURE 

Participants in all conditions provided written informed consent, 
and the protocol used in this experiment was approved by the 
Office of Research Protections at The Pennsylvania State University 
(IRB protocol #16986). 

In the Audio-only condition, participants were instructed to 
listen to an audio stream and informed they would be tested on 
knowledge acquired from this familiarization. Participants were 
not informed that the audio stream was an artificial language. 
The familiarization stream and test were presented using E-prime 
software. Using E-Prime, participants were asked to judge whether 
the test item was a word, based on the preceding familiarization 
stream, by pressing the keys marked "yes" or "no" on a keyboard. 

In the two audiovisual conditions, participants were instructed 
to view a short movie and informed that they would be tested 
following the movie. There were no explicit instructions given 



about the nature of the movie, nor were participants informed 
that the audio stream was composed of an artificial language. 
Familiarization streams were presented using iTunes (version 
7.0) software. Following familiarization, participants completed 
the same identification test as in the audio-only condition, pre- 
sented using E-Prime software. There was no video display during 
test. 

ANALYSIS 

Using signal detection theory, d! (hit rate - false alarm rate) was 
calculated to determine participants' sensitivity to detecting words. 
Since endorsement of McGurk and Audio-only word items could 
be categorized as either hits or false alarms depending on the 
condition, we elected to define hits as endorsement of audiovisual- 
consistent word items and defined false alarms as endorsement of 
non-words (which never occurred during familiarization, provid- 
ing an accurate index of false alarm rate). Thus, d! was calculated 
by subtracting the standardized endorsement rate for non-words 
from the standardized endorsement rate for audiovisual words: 
d' = z[P( "yes" I audiovisual words)] - z[P("yes"INon- words)]. In 
this task, a d! of 0 represents chance performance (participants 
were equally likely to endorse words and non-words), while a d' 
significantly above 0 represents learning (participants were more 
likely to endorse words than non-words). In order to assess the 
learning of the McGurk words, we compared endorsement rates 
(the probability that a participant would choose "yes" for an item) 
across display conditions. 

RESULTS 
d ANALYSIS 

All statistical tests were two-tailed. The mean d' score in the Audio- 
only condition was 0.47 (SD = 2.31), a level of performance 
that was not significantly above chance, r(47) = 1.39, p = 0.170, 
Cohen's d = 0.20 (see Figure 1). The mean d! score in the Audio- 
visual Consistent condition was —0.95, which was significantly 
below chance, r(47) = -2.89, p = 0.006, Cohen's d = -0.42. The 
mean d! score in the Audiovisual Inconsistent condition was 1.79 
(SD = 2.47), which was significantly above chance, r(47) = 4.99, 
p < 0.001, Cohen's d = 0.72. A one-way ANOVA found a signifi- 
cant difference in d' scores across conditions, P(2,141) = 16.194, 
MSE condition = 89.93, p < 0.001, r]£ = 0.187. A Bonferroni 
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Display Condition 

Error bars: +/- 1 SEM 

FIGURE 1 | Mean d' across display types. The horizontal line represents 
chance performance (d' = 0). 



post-hoc analysis revealed significant pairwise differences between 
all three display conditions (all p's < 0.05). 

ENDORSEMENT RATE ANALYSIS 

We report the endorsement rates for each type of test item in 
Figure 2. We first compared endorsement rates across item type 
and condition in a 5 (item type) x 3 (display) mixed-factor 
Repeated Measures AN OVA, where item type was a within-subjects 
factor and display was between-subjects. In this analysis, there 
was a significant main effect for item type [F(4,564) = 4.53, 
MSE = 0.21, p = 0.001, T|p = 0.031], a significant main effect 
for display condition [F(2,141) = 16.68, MSE = 1.42, p < 0.001, 
r|p = 0.191], and a significant interaction between item type and 
display [P(8,564) = 11.63, MSE = 0.531, p < 0.001, r) 2 = 0.142]. 

To further examine the interaction between display condition 
and endorsement rates, we performed separate One-way ANOVAs 
comparing endorsement rate across display conditions for each 
of the five item types (see Figure 2). There were significant 1 
main effects of condition on endorsement of the three "word" 
test items: AV words, F(2,143) = 26.03, MSE = 1.58, p < 0.001, 
T)2 = 0.270; Audio words, F(2,143) = 23.78, MSE = 1.17, 
p < 0.001, n 2 _ o_252; McGurk words, F(2,143) = 10.32, 
MSE = 0.66, p < 0.001, r| 2 = 0.128. Subsequent linear contrast 
analyses 2 on each of the three word items reveal that endorse- 
ment rate was significantly greater in the Audiovisual Inconsistent 
than in Audio-only and Audiovisual Consistent conditions: AV 
words, f(141) = 5.24, p < 0.001, Cohen's d = 0.88; Audio words, 
f(141) = 2.97, p = 0.004, Cohen's d = 0.50; McGurk words, 
r(141) = 3.70, p < 0.001, Cohen's A = 0.62. There were no sig- 
nificant main effects of condition on endorsement of the two foil 
test items: part-words, F(2,143) = 0.67, MSE = 0.04, p = 0.512, 
r| 2 = 0.033; Non-words, F (2,143) = 2.42, MSE = 0.10, p = 0.093, 



1 Significance level was adjusted for multiple comparisons. The corrected alpha for 
five comparisons was 0.01. 

2 Contrast weights were — f, — f, 2 for Audio-only, Audiovisual Consistent, and 
Audiovisual Inconsistent conditions, respectively 



T]p = 0.009. Since the omnibus ANOVAs were not significant, 
contrast analyses were not conducted for these two item types. 

DISCUSSION 

The goal of the present study was to test whether input from mul- 
tiple modalities could be integrated during statistical learning, 
utilizing the McGurk effect to manipulate the perceived statisti- 
cal structure of a speech stream. We presented learners with an 
artificial language in which word boundaries were not cued by 
transitional probabilities. The stream was either presented in iso- 
lation (audio-only condition) or synchronized with a visual display 
that either matched the audio stream (Audiovisual Consistent con- 
dition) or was discrepant in two word-final syllables (Audiovisual 
Inconsistent condition), eliciting a McGurk illusion that altered 
the statistical structure by adding boundary information. 

The results of the present study support our predictions that the 
McGurk illusion in the Audiovisual Inconsistent condition should 
facilitate participants' ability to use statistical cues to segment 
a continuous speech stream. In the Audio-only and Audiovi- 
sual conditions, segmentation performance, as measured by d! , 
was not significantly above chance. In contrast, performance in 
the Audiovisual Inconsistent (i.e., McGurk) condition was above 
chance and was significantly greater than the Audio-only and 
Audiovisual Consistent conditions. In addition, the pattern of 
endorsement rates supports our conclusions from the d' analysis, 
as we found a significant effect of display condition on endorse- 
ment rates. In particular, participants were significantly more 
likely to endorse the AV word items in the Audiovisual Inconsistent 
condition. Because these items were consistent across the audio 
and visual input during familiarization, audiovisual endorsement 
rate is independent from participants' perception of the McGurk 
items. AV word endorsement rate therefore provides a measure of 
whether the McGurk illusion affected the global statistical struc- 
ture of the language. Taken together, the d! and endorsement 
rate analyses demonstrate a significant increase in segmentation 
performance in the Audiovisual Inconsistent condition, suggest- 
ing that learners are capable of audiovisual integration during 
statistical learning. 

It is worth noting that performance in the Audiovisual Con- 
sistent condition was significantly lower than the Audio-only 
Condition, and this appears to be the result of systematically lower 
endorsement of word items at test. This is a counter-intuitive 
finding, as our a priori hypothesis was that performance would 
be similar across the Audio-only and Audiovisual Consistent 
conditions. Nonetheless, the goal of the Audiovisual Consistent 
condition was to rule out the possibility that any enhancement 
in performance in the Audiovisual Inconsistent condition was not 
merely due to the incorporation of a video display (e.g., through 
increased attention; see Toro etal., 2005). Since learning was sig- 
nificantly greater in the Audiovisual Inconsistent condition than 
in either the Audio-only or Audiovisual Consistent conditions, we 
can conclude that this facilitation of learning in the Audiovisual 
Inconsistent condition was due to the integrated, illusory per- 
cept's enhancement of the transitional probability cues to word 
boundaries. 

To the best of our knowledge, our results provide the 
first demonstration of multimodal integration during speech 
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Error bars: +/- 1 SEM 



Audio Only Consistent AV Inconsistent AV 

Display Condition 

FIGURE 2 | Mean endorsement rates (i.e., probability of saying "yes") for the different types of items across conditions. 



segmentation via statistical learning. In the field of statistical learn- 
ing, as well as in research on language acquisition, there has been 
growing support for the involvement of multiple sensory modal- 
ities in the learning process. For example, several studies have 
demonstrated a role for vision (e.g., facial movements) in statis- 
tical learning (e.g., Sell and Kaschak, 2009; Cunillera etal., 2010; 
Mitchel and Weiss, 2010; Thiessen, 2010; Van den Bos et al, 2012). 
However, these studies have not addressed how cross-modal inte- 
gration may change the input landscape over which statistical 
learning takes place. Here, we have demonstrated that learners 
have the capacity to integrate multimodal input during statistical 
learning, altering the pattern of speech segmentation. 

While the results of the present study establish that the integra- 
tion of audiovisual information can alter statistical learning, our 
data do not delineate whether the stored representations include 
either the integrated percept (e.g., /ni/), or the corresponding uni- 
modal percepts (e.g., audio /mi/ and visual /gi/), or perhaps both. 
According to modality-specific theories of multisensory integra- 
tion (e.g., Bernstein etal., 2004), multimodal statistical learning 
would result in the encoding of sensory-specific representations. 
Alternatively, many common format theories of audiovisual inte- 
gration (e.g., Fowler, 2004; Summerfield, 1987; Rosenblum, 2005) 
hold that each unimodal input is transformed 3 into a singular 
amodal signal with a "common currency" across sensory modal- 
ities. Our data do not distinguish between these mechanisms of 
multisensory integration, though future work may be able to adapt 
our paradigm to directly test (e.g., with a two-alternative-forced 
choice test) the relative availability of unimodal and multimodal 
representations after familiarization. 



3 It should be noted that not all common format theories propose the necessity of 
transforming multimodal stimuli into a single representation. For example, Gib- 
son's (1969) invariant detection view proposes that amodal information is directly 
available in sensory input, and therefore no translation is necessary. 



The ability to integrate multimodal perceptual input is con- 
sistent with a modality-interactive view of statistical learning. 
Prior research on statistical learning in a multimodal environ- 
ment has identified modality-specific constraints on statistical 
learning (Conway and Christiansen, 2005; see also, Conway and 
Christiansen, 2006, 2009; Emberson etal, 2011). For example, 
Conway and Christiansen (2005) observed quantitative advan- 
tages in auditory domain for extracting temporal regularities 
relative to the tactile and visual domain. In addition, the authors 
reported discrepancies in the kind of structure to which learn- 
ers were sensitive in each modality. Such modality constraints 
suggest that statistical learning is governed by an array of modality- 
specific mechanisms (in contrast to, e.g., Kirkham etal., 2002; 
Thiessen, 2011). The present study, in concert with recent evidence 
from multimodal statistical learning paradigms, demonstrates 
a cross-modal effect during statistical learning. Thus, we have 
suggested (Mitchel and Weiss, 2011; see also Emberson etal., 
20 1 1 ) that while statistical learning may be governed by modality- 
specific subsystems, these systems are linked within an interactive 
network. We propose that associations across modalities pro- 
duce cross-modal effects on learning observed in the current 
study. This proposal is consistent with modality- specific theo- 
ries of multisensory integration (see Bernstein etal., 2004), which 
propose that audiovisual speech perception results in separate, 
modality- specific representations that become linked upstream in 
processing. Furthermore, our proposal is consistent with recent 
neuroimaging work revealing that sensory encoding employs a 
distributed network of overlapping cortical regions across senses 
(e.g., Ghazanfar and Schroeder, 2006; Liang etal., 2013; Okada 
etal, 2013). For example, unimodal auditory input has been 
shown to elicit a distinct pattern of neural activity in the primary 
visual cortex, and vice versa (Liang etal, 2013). These find- 
ings provide neural evidence of distinct yet associated processing 
of sensory information across modalities, which is compatible 
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with the view of multisensory statistical learning posited 
here. 
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